1 Defining Expected Points

In this notebook I develop and explore an expected points model at the play level for evaluating college football offenses and defenses. The goal of this analysis is to place a value on offensive/defensive plays in terms of their contribution’s to a team’s expected points.

The data used is from college football games from 2003 to present. Each observation represents one play in a game, in which we know the team, the situation (down, time remaining), and the location on the field (yards to go, yards to reach end zone). We have information about the types of plays called as well in a text field.

1.1 Sequences of Play

For each play in a game, I model the probability of the next scoring event that will occur within the same half. This means the analysis is not at the drive level, but at what I dub the sequence level. Suppose a team has the ball on offense to start the first half. The next scoring event can take on one of seven outcomes:

  • Touchdown (7 points)
  • Field goal (3 points)
  • Safety (2 points)
  • No Score (0 points)
  • Opp safety (-2 points)
  • Opp field goal (-3 points)
  • Opp touchdown (-7 points)

If the team on offense drives down and scores a TD/FG, this will end the sequence. If the team on offense does not score but punts or turns the ball over, the sequence will continue with the other team now on offense. The sequence will continue until either one team scores, or the half comes to an end. From this, a sequence begins at kickoff and ends at the next kick off.

Suppose we have two teams, A and B, playing in a game. Team A receives the opening kickoff, drives for a few plays, and then punts. Team B takes over, which starts drive 2, and they drive for a few plays before also punting. Team A then manages to put together a drive that finally scores.

All plays on these three drives are one sequence. The outcome of this sequence is the points scored by Team A - if they score a touchdown, their points from this sequence is 7 (assuming for now they make the extra point). Team B’s points from this sequence is -7 points.

When Team A kicks off to Team B to start drive 4, we start our next sequence, which will end either with one team scoring or at the end of the half. We’ll then start over with a new sequence in the second half.

Why model the outcome of sequences rather than individual drives? Individual plays have the potential to affect both team’s chances of scoring, positively or negatively, and we want our model to directly capture this. If an offense turns the ball over at midfield, they are not only hurting their own chances of scoring, they are increasing the other team’s chance of scoring. The value of a play in terms of expected points is function of how both team’s probabilities are affected by the outcome.

1.2 Defining Expected Points

A team’s expected points is sum of the probability of each possible scoring event multiplied by the points of that event. For this analysis, I assume that touchdowns equate to 7 rather than 6 points, assuming that extra points will be made. I can later bake in the actual probability of making extra points, but this will be a simplification for now.

For a given play \(i\) for Team \(A\), we can compute Team A’s expected points using the following:

\[ {Expected Points}_A = \\Pr(TD)*7 + \\ Pr(FG)*3 + \\Pr(Safety)*2 + \\ Pr(No Score)*0 + \\ Pr(Opp. Safety)*-2 + \\ Pr(Opp. FG) * -3 +\\ Pr(Opp. TD) * -7 \]

How do we get the probabilities of each scoring event? We learn these from historical data by using a model - I train a multinomial logistic regression model on many seasons worth of college football plays to learn how situations on the field affect the probability of the next scoring event.

1.3 Next Scoring Event

The outcome for our analysis is the NEXT_SCORE_EVENT. Each play in a given sequence contributes to the eventual outcome of the sequence. Here we can see an example of one game and its drives:

For this game, we can filter to the plays that took place in the lead up to first score event. In this case, the first sequence included one drive and ended when Texas A&M kicked a field goal.

If we look at another sequence in the second half, there were multiple drives before a team was able to score in that sequence. The next scoring event is always defined from the perspective of the offense.

2 Modeling Expected Points

Our goal is to understand how individual plays contribute to a team’s expected points, or the average points teams should expect to have given their situation (down, time, possession).

For instance, in the first drive of the Texas A&M-Florida game in 2012, Texas A&M received the ball at their own 25 yard line to open the game. The simplest intuition of expected points is to ask, for teams starting at the 25 yard line at the beginning of a game, how many points do they typically go on to score? The answer is to look at all starting drives with 75 yards to go and see what the eventual next scoring event was for each of these plays - we take the average of all of the points that followed from this situation.

In this case, this means teams with the ball at their own 25 to start the game generally obtained more points on the ensuing sequence than their opponents, so they have a slightly positive expected points.

But, this is also a function of the down. If we look at the expected points for a team in this situation in first down vs a team in this situation for fourth down, we should see a drop in their expected points - by the time you hit fourth down, if you haven’t moved from the 25, your expected points drops into the negatives, as you will now be punting the ball back to your opponent and it becomes more probable that they score than you.

The fact that the expected point changes based on the down and yard line allows us to look at the difference between expected points from play to play - the difference in expected points based on how the situation changed allows us to compute the Expected Points Added from a single play.

For any given play, we get a sense of the expected points a team can expect from their situation. For instance, if we look at all total plays in a game, how do expected points vary as a function of a team’s distance from their opponent’s goal line?

This should make sense - if you’re backed up against your own end zone, your opponent has higher expected points because they are, historically, more likely to have the next scoring event, either by gaining good field advantage after you punt or by getting a safety. We can see this if we just look at the proportion of next scoring events based on the offense’s position on the field.

From this, when we see an offense move the ball up the field on a given play, we will generally see their expected points go up. The difference in expected points before the snap and after the snap is the value added (positively or negatively) by the play.

But, it’s not just position on the field - it’s also about the situation. If we look at how expected points varies by the down, we should see that fourth downs have lower expected points.

We also have other features like distance to convert the first down (filtering here to plays with a maximum of 30 yards to go, as we start to run out of data at higher values and it looks wonky).

And we also have info on time remaining in the half - as we might expect, the proportion of drives leading to no scoring goes up as the amount of time remaining in the half goes down.

We use all of this historical data to learn the expected points from a given situation, then look at the difference in expected points from play to play - this is the intuition behind how we will value individual plays, which we can then roll up to the offense/defense/game/season level.

2.1 Building Models

How do these various features like down, distance, yards to goal, and time remaining affect the probability of the next scoring event? We use a model to learn this relationship from historical plays. I’ll now proceed to building the model which I’ll use for the bulk of the analysis.

I’ll set up training, validation, and test sets based around the season. I’m mostly going to build the model using plays from the 2007 season onwards, as the data quality of the play by play data starts to get worse the further back we go, though I’ll do some backtesting of the model on older seasons.

I’m going to use the seasons of 2007-2018 as my main training set, building and evaluating the model using a leave-one-season out approach, akin to k-fold cross validation using seasons as the folds. I’ll use the 2019-2020 seasons as a validation set, and leave 2021 as my test set which I won’t look at till later on.

# full plays
plays_full = plays_data_score_events %>%
        filter(PLAY_TYPE != 'Kickoff') %>%
        arrange(SEASON, GAME_ID) %>%
        # group_by(DEFENSE) %>%
        # mutate(DEFENSE_PLAY_NUMBER = row_number()) %>%
        # group_by(OFFENSE) %>%
        # mutate(OFFENSE_PLAY_NUMBER = row_number()) %>%
        # ungroup() %>%
        # group_by(OFFENSE, GAME_ID) %>%
        # mutate(OFFENSE_GAME_NUMBER = row_number()) %>%
        # group_by(DEFENSE, GAME_ID) %>%
        # mutate(DEFENSE_GAME_NUMBER = row_number()) %>%
        ungroup() %>%
        select(GAME_ID,
               DRIVE_ID,
               PLAY_ID,
               SEASON,
               HOME,
               AWAY,
               OFFENSE,
               DEFENSE,
               OFFENSE_CONFERENCE,
               DEFENSE_CONFERENCE,
               OFFENSE_SCORE,
               DEFENSE_SCORE,
               # OFFENSE_PLAY_NUMBER,
               # DEFENSE_PLAY_NUMBER,
               SCORING,
               PLAY_TEXT,
               PLAY_TYPE,
               NEXT_SCORE_EVENT_HOME,
               NEXT_SCORE_EVENT_HOME_DIFF,
               NEXT_SCORE_EVENT_OFFENSE,
               NEXT_SCORE_EVENT_OFFENSE_DIFF,
               YARD_LINE,
               HALF,
               PERIOD,
               MINUTES_IN_HALF,
               SECONDS_IN_HALF,
               DOWN,
               DISTANCE,
               YARD_LINE,
               YARDS_TO_GOAL) %>%
        filter(DOWN %in% c(1, 2, 3, 4)) %>%
        filter(PERIOD %in% c(1,2,3,4)) %>%
        filter(!is.na(SECONDS_IN_HALF)) %>%
        filter(DISTANCE >=0 & DISTANCE <=100) %>%
        filter(!is.na(NEXT_SCORE_EVENT_OFFENSE)) %>%
        mutate(NEXT_SCORE_EVENT_OFFENSE = factor(NEXT_SCORE_EVENT_OFFENSE,
                                                 levels = c("No_Score",
                                                            "TD",
                                                            "FG",
                                                            "Safety",
                                                            "Opp_Safety",
                                                            "Opp_FG",
                                                            "Opp_TD"))) %>%
        arrange(SEASON, GAME_ID, PLAY_ID)

# training set
plays_train = plays_full %>%
        filter(SEASON >= 2007 & SEASON <2019)

# validation set
plays_valid = plays_full %>%
        filter(SEASON >= 2019 & SEASON <= 2020)

# test
plays_test = plays_full %>%
        filter(SEASON > 2020)

# make an initial split based on previously defined splits
valid_split = make_splits(list(analysis = seq(nrow(plays_train)),
                                 assessment = nrow(plays_train) + seq(nrow(plays_valid))),
                               bind_rows(plays_train,
                                         plays_valid))

# test split
test_split = make_splits(
        list(analysis = seq(nrow(plays_train) + nrow(plays_valid)),
             assessment = nrow(plays_train) + nrow(plays_valid) + seq(nrow(plays_test))),
        bind_rows(plays_train,
                  plays_valid,
                  plays_test))

The outcome is the next scoring event, always defined from the perspective of the offense for any given play.

2.1.1 Baseline

I currently use the following as features for plays in a baseline model:

  • Quarter
  • Seconds Remaining in Half
  • Down
  • Distance (logged)
  • Yards to opponent’s end zone
  • Down and goal indicator for whether the offennse is in a ‘first and goal’ situation

I also include interactions between down and distance, down and yards to end zone, and yards to end zone and seconds remaining. This baseline model doesn’t account for things like offense/defense quality or scoring effects, meaning this analysis focused on estimating the expected points given the situation without respect to opponent.

baseline_recipe = recipe(NEXT_SCORE_EVENT_OFFENSE ~.,
                         data = plays_train) %>%
        update_role(all_predictors(),
                    new_role = "ID") %>%
        update_role(
                c("GAME_ID",
                  "DRIVE_ID",
                  "PLAY_ID",
                  "SEASON",
                  "HOME",
                  "AWAY",
                  "OFFENSE",
                  "DEFENSE",
                  "OFFENSE_CONFERENCE",
                  "DEFENSE_CONFERENCE",
                  "SCORING",
                  "OFFENSE_SCORE",
                  "DEFENSE_SCORE",
                  "PLAY_TEXT",
                  "PLAY_TYPE",
                  "NEXT_SCORE_EVENT_HOME",
                  "NEXT_SCORE_EVENT_HOME_DIFF",
                  "NEXT_SCORE_EVENT_OFFENSE_DIFF",
                  "YARD_LINE",
                  "MINUTES_IN_HALF",
                  "HALF"),
                new_role = "ID") %>%
        step_mutate(PERIOD_ID = PERIOD,
                    role = "ID") %>%
        # features we're inheriting
        update_role(
                c("PERIOD", 
                "SECONDS_IN_HALF",
                "DOWN",
                "DISTANCE",
                "YARDS_TO_GOAL"),
                new_role = "predictor") %>%
        # filters for issues
        step_filter(!is.na(NEXT_SCORE_EVENT_OFFENSE)) %>%
        step_filter(YARD_LINE <= 100 & YARD_LINE >=0) %>%
        step_filter(YARDS_TO_GOAL <=100 & YARD_LINE >=0) %>%
        step_filter(DOWN %in% c(1, 2, 3, 4)) %>%
        step_filter(DISTANCE >=0 & DISTANCE <=100) %>%
        step_filter(SECONDS_IN_HALF <=1800) %>%
        step_filter(!is.na(SECONDS_IN_HALF)) %>%
        step_filter(PERIOD_ID == 1 | PERIOD_ID == 2 | PERIOD_ID == 3 | PERIOD_ID == 4) %>%
        # create features
        step_mutate(KICKOFF = case_when(grepl("kickoff", tolower(PLAY_TEXT)) | grepl("kickoff", tolower(PLAY_TYPE))==T ~ 1,
                                        TRUE ~ 0)) %>%
        step_mutate(TIMEOUT = case_when(grepl("timeout", tolower(PLAY_TEXT)) ~ 1,
                                        TRUE ~ 0)) %>%
        step_filter(TIMEOUT != 1) %>%
        step_filter(KICKOFF != 1) %>%
        step_mutate(DOWN_TO_GOAL = case_when(DISTANCE == YARDS_TO_GOAL ~ 1,
                                             TRUE ~ 0)) %>%
        step_mutate(DOWN = factor(DOWN)) %>%
        step_mutate(PERIOD = factor(PERIOD)) %>%
        step_log(DISTANCE, offset =1) %>%
        step_dummy(all_nominal_predictors()) %>%
        step_novel(all_nominal_predictors(),
                   new_level = "new") %>%
        step_interact(terms = ~ DISTANCE:(starts_with("DOWN_"))) %>%
        step_interact(terms = ~ YARDS_TO_GOAL:(starts_with("DOWN_"))) %>%
        step_interact(terms = ~ YARDS_TO_GOAL*SECONDS_IN_HALF) %>%
        check_missing(all_predictors()) %>%
        step_normalize(all_numeric_predictors())

2.1.2 Defense Fixed Effects

Not all defenses are created equal. When evaluating offenses, we’ll want to account for the quality of the defense a team is facing. I’ll handle this by making recipe with a fixed effect for the defense that the offense is facing.

2.1.3 Offense Fixed Effect

And also a model specification with a fixed effect for the offense the defense is facing, so we can flip the analysis around for the defense.

2.2 Workflows

I’ll define the model I’ll be using here, which is a multinomial logistic regression.

# from glmnet
multinom_mod = multinom_reg(
  mode = "classification",
  engine = "glmnet",
  penalty = 0,
  mixture = NULL
)

I’ll then create workflows for each.

# create baseline
baseline_wf = workflow() %>%
        add_recipe(baseline_recipe) %>%
        add_model(multinom_mod)

# defense adjusted
defense_adjusted_wf = workflow() %>%
        add_recipe(defense_adjusted_recipe) %>%
        add_model(multinom_mod)

# offense adjusted
offense_adjusted_wf = workflow() %>%
        add_recipe(offense_adjusted_recipe) %>%
        add_model(multinom_mod)

# workflow settings
# metrics
class_metrics<-metric_set(yardstick::roc_auc,
                          yardstick::mn_log_loss)

# control for resamples
keep_pred <- control_resamples(save_pred = TRUE, 
                               save_workflow = TRUE,
                               allow_par=T)

At this point I can remove the recipes, as they’re now embedded in the workflows.

## Loading required package: iterators
## Loading required package: parallel

2.3 Training

# # fit to resamples
# resamples_multinom = multinom_wf %>%
#         fit_resamples(data = plays_train,
#                       metrics = class_metrics,
#                       resamples = manual_resamples,
#                       control = keep_pred,
#                       verbose=T)

# # # save locally so as to not need to retrain everytime
# write_rds(resamples_multinom,
#      file = here::here("models", "resamples_expected_points.Rds"),
#      compress = "gz")
# fit the model to the whole training set
fit_baseline = baseline_wf %>%
        fit(data = plays_train)

# fit the defense adjusted model
fit_defense_adjusted = defense_adjusted_wf %>%
        fit(data = plays_train)

# fit the offense adjusted model
fit_offense_adjusted = offense_adjusted_wf %>%
        fit(data = plays_train)

2.4 Inference

2.4.1 Predicted Probabilities and Partial Effects

Understanding partial effects from a multinomial logit is already difficult, and I’ve thrown a bunch of interactions in there to make this even more difficult.

I’ll look at predicted probabilities using an observed values approach for particular features (using a sample rather than the full dataset to save time). This means taking the model and then altering the feature of interest for every observation and taking the average predicted probability for each outcome across all observations.

How is the probability of the next scoring event influenced by where the offense has possession?

How is this affected by the down?

How does this translate into expected points?

2.4.2 Team Effects

For the models that include team effects, we end up with a coefficient for the opponent’s offense or defense that affects the amount of points an offense or defense can expect when playing a team.

This means, for instance, that when facing a team like Alabama, your expected points are expected to be lower regardless of where you are on the field due to the strength of their defense. If you’re facing a weaker team, like Idaho, your expected points are higher.

I interacted team effects with season, which means this effect will vary for each team by the season.

2.5 Validation Set

We can evaluate the model via a leave-one-season out approach, or via some in sample metrics of fit, but I’ll predict the validation set as check. I’ll compare performance relative to a null model that simply predicts the incidence rate of each outcome in the training set.

What’s the log loss for each outcome?

2.6 Scoring Plays

I’ll now start diving into the predictions for individual plays as a means to evaluate plays and teams.

It’s worth noting that we might see some season-level differences that make comparison across seasons difficult, since the predictions are all coming from slightly different models due to resampling.

I’ll get Expected Points Added for all non scoring plays. This part can be a tricky, due to data quality issues in both the reported yards to goal and in defining sequences. The basic thought here is to say, at the start of a play, we know the expected points for a team in that situation, EP_Pre. We then look to the next play to see the expected points for the team after the result of the previous play, EP_Post. EP_Added is the difference between these two outcomes from the perspective of the offense.

This means that if the ball is turned over, but not scored, the team on offense becomes the defense and the sign of the expected points on the next play flips for their calculation. For events that produce touchdowns, EP_Added is empty, but I create another feature simply called Points_Added in which I take the difference between EP_Pre and the points scored on the play, 7 for touchdowns, 3 for FGs, 2 for safeties.

End this writeup here. Move to another write up.